Data visualization: exercises



In [2]:

    
%matplotlib inline

import matplotlib.pyplot as plt

Can you plot an histogram of word frequencies for the data/aristotle.txt file?



In [11]:

    
plt.figure(figsize=(18, 7))

words = {}
for line in open('../data/aristotle.txt'):
    for word in line.rstrip().split():
        words[word] = words.get(word, 0)
        words[word] += 1

plt.bar(range(len(words)),
        sorted(words.values()))

plt.xlabel('word')
plt.xlabel('occurrences');

Can you investigate the relationships between the variables of the data/abundance.tsv file? Which plot type is best for this task?

Can you investigate the relationships between the variables of the data/abundance.tsv file in a single figure, using subplots?



In [12]:

    
import pandas as pd



In [14]:

    
ab = pd.read_table('../data/abundance.tsv')



In [15]:

    
ab.head()









    Out[15]:






  
    
      
      target_id
      length
      eff_length
      est_counts
      tpm
    
  
  
    
      0
      ENST00000473358.1
      712
      578.499
      13.83230
      1.80247
    
    
      1
      ENST00000469289.1
      535
      341.689
      7.17452
      1.58284
    
    
      2
      ENST00000417324.1
      1187
      891.822
      0.00000
      0.00000
    
    
      3
      ENST00000461467.1
      590
      292.837
      155.16400
      39.94280
    
    
      4
      ENST00000466430.5
      2748
      2767.370
      240.02000
      6.53814



In [48]:

    
plt.figure(figsize=(12, 18))

plt.subplot(321)
plt.plot(ab['length'],
         ab['eff_length'],
         'k.')
plt.xlabel('length')
plt.ylabel('eff_length')

plt.subplot(322)
plt.plot(ab['length'],
         ab['est_counts'],
         'k.')
plt.xlabel('length')
plt.ylabel('est_counts')

plt.subplot(323)
plt.plot(ab['length'],
         ab['tpm'],
         'k.')
plt.xlabel('length')
plt.ylabel('tpm')

plt.subplot(324)
plt.plot(ab['eff_length'],
         ab['est_counts'],
         'k.')
plt.xlabel('eff_length')
plt.ylabel('est_counts')

plt.subplot(325)
plt.plot(ab['eff_length'],
         ab['tpm'],
         'k.')
plt.xlabel('eff_length')
plt.ylabel('tpm')

plt.subplot(326)
plt.plot(ab['est_counts'],
         ab['tpm'],
         'k.')
plt.xlabel('est_counts')
plt.ylabel('tpm');



In [24]:

    
import seaborn as sns



In [27]:

    
# much easier with seaborn
sns.pairplot(ab.set_index('target_id'));

Can you investigate the relationships between the variables of the data/abundance.tsv file in a single plot? You might want to use different colors...



In [30]:

    
plt.figure(figsize=(10, 10))

plt.plot(ab['length'],
         ab['eff_length'],
         '.',
         label='eff_length')

plt.plot(ab['length'],
         ab['est_counts'],
         '.',
         label='eff_length')

plt.plot(ab['length'],
         ab['tpm'],
         '.',
         label='tpm')

plt.legend(loc='best')
plt.xlabel('length')
plt.ylabel('other variable');

Can you plot the relationship between word length and number of vowels in the data/unixdict.txt file?



In [40]:

    
vowels = set('aeiouy')

dictionary1 = {}
# key: word
# value: length of the word
dictionary2 = {}
# key: word
# value: number of vowels

for line in open('../data/unixdict.txt'):
    word = line.rstrip()
    dictionary1[word] = len(word)
    dictionary2[word] = len(set(word).intersection(vowels))



In [43]:

    
plt.figure(figsize=(7, 7))

plt.plot(dictionary1.values(),
         dictionary2.values(),
         'ko')

plt.xlabel('word length')
plt.ylabel('number of vowels');

Can you plot three variables (with very different scales) in a single plot? You can google to look for an answer...



In [45]:

    
plt.figure(figsize=(7, 7))

plt.plot(ab['length'],
         ab['eff_length'],
         'k.')
plt.xlabel('length')
plt.ylabel('eff_length')
ax = plt.twinx()
ax.plot(ab['length'],
        ab['tpm'],
        'r.')
ax.set_ylabel('tpm');

Can you figure out how to make boxplots out of one of the variables of the data/abundance.tsv file?



In [55]:

    
plt.figure(figsize=(2, 7))

plt.boxplot(ab['length'])
# restrict the range of the plot
plt.ylim(0, 10000)









    Out[55]:





(0, 10000)

	target_id	length	eff_length	est_counts	tpm
0	ENST00000473358.1	712	578.499	13.83230	1.80247
1	ENST00000469289.1	535	341.689	7.17452	1.58284
2	ENST00000417324.1	1187	891.822	0.00000	0.00000
3	ENST00000461467.1	590	292.837	155.16400	39.94280
4	ENST00000466430.5	2748	2767.370	240.02000	6.53814